T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome
نویسندگان
چکیده
RNA sequencing based on next-generation sequencing technology is useful for analyzing transcriptomes, discovering novel genes and studying exon/intron structures. Similar to genome assembly, de novo transcriptome assembly does not rely on a reference genome and additional annotated information. Most, if not all, existing de novo transcriptome assemblers rely heavily on de novo genome assembly techniques without fully utilizing the properties of transcriptomes and may result in short contigs because of the splicing nature (shared exons) of the genes and the repeats that exist in different genes. In this paper, we analyze the properties of the mammalian transcriptome and propose an algorithm to reconstruct expressed isoforms without a reference genome. We extend the iterative de Bruijn graph approach (IDBA) by using pair-end information to solve the problem of long repeats in different genes and the problem of branching in the same gene due to alternative splicing. The graph will be decomposed into small components, each of which corresponds to a few, if not single, genes. The most possible isoforms with sufficient support from the pair-end reads will be found heuristically by depth-first search. In practice, our de novo transcriptome assembler, T-IDBA, outperforms Abyss (one of the newest de novo transcriptome assembler) substantially in terms of sensitivity and precision for both simulated and real data. The experimental results also match with our theoretical analysis of the performance of T-IDBA, which guarantees most isoforms can be reconstructed as long as their coverage exceeds a certain threshold. Availability: T-IDBA is available at http://www.cs.hku.hk/~alse/idba/
منابع مشابه
T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome - (Extended Abstract)
RNA sequencing based on next-generation sequencing technology is useful for analyzing transcriptomes, discovering novel genes and studying exon/intron structures. Similar to genome assembly, de novo transcriptome assembly does not rely on a reference genome and additional annotated information. Most, if not all, existing de novo transcriptome assemblers rely heavily on de novo genome assembly t...
متن کاملDe Bruijn Graph based De novo Genome Assembly
The Next Generation Sequencing (NGS) is an important process which assures inexpensive organization of vast size of raw sequence data set over any traditional sequencing systems or methods. Various aspects of NGS like template preparation, sequencing imaging and genome alignment and assembly outlines the genome sequencing and alignment .Consequently, deBruijn Graph (dBG) is an important mathema...
متن کاملIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
The de Bruijn graph assembly approach breaks reads into k-mers before assembling them into contigs. The string graph approach forms contigs by connecting two reads with k or more overlapping nucleotides. Both approaches must deal with the following problems: false-positive vertices, due to erroneous reads; gap problem, due to non-uniform coverage; branching problem, due to erroneous reads and r...
متن کاملIDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels
MOTIVATION RNA sequencing based on next-generation sequencing technology is effective for analyzing transcriptomes. Like de novo genome assembly, de novo transcriptome assembly does not rely on any reference genome or additional annotation information, but is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100), which make it very difficult to identify low...
متن کاملIDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth
MOTIVATION Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013